Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

140

Applications in Natural Language Processing

FIGURE 5.14

Attention-head view for (a) full-precision BERT, (b) fully binarized BERT baseline, and

full-precision model, while baseline suﬀers indistinguishable attention for information degra-

dation.

ited capabilities, the ideal binarized representation should preserve the given full-precision

counterparts as much as possible means the mutual information between binarized and

full-precision representations should be maximized. When the deterministic sign function

is applied to binarize BERT, the goal is equivalent to maximizing the information entropy

H(B) of binarized representation B [171], which is deﬁned as

H(B) = −

p(B) log p(B),

(5.26)

where B ∈{−1, 1} is the random variable sampled from B with probability mass function

p. Therefore, the information entropy of binarized representation should be maximized to

better preserve the full-precision counterparts and let the attention mechanism function

well.

As for the attention structure in full-precision BERT, the normalized attention weight

obtained by softmax is essential. But direct application of binarization function causes a

complete information loss to binarized attention weight. Speciﬁcally, since the softmax(A)

is regarded as following a probability distribution, the elements of B^s

A ^{are all quantized to}

1 (Fig. 5.14(b)) and the information entropy H(B^s

A^{) degenerates to 0. A common measure}

to alleviate this information degradation is to shift the distribution of input tensors before

applying the sign function, which is formulated as

ˆB^s

A ^{= sign (softmax(}^A⁾⁻^τ⁾^,

(5.27)

where the shift parameter τ, also regarded as the threshold of binarization, is expected to

maximize the entropy of the binarized ^ˆB^s

A ^{and is ﬁxed during the inference. Moreover, the}

attention weight obtained by the sign function is binarized to {−1, 1}, while the original

attention weight has a normalized value range [0, 1]. The negative value of attention weight

in the binarized architecture is contrary to the intuition of the existing attention mechanism

and is also empirically proved to be harmful to the attention structure.

To mitigate the information degradation caused by binarization in the attention mech-

anism, the authors introduced an eﬃcient Bi-Attention structure for fully binarized BERT,

which maximizes information entropy of binarized representations statistically and applies

bitwise operations for fast inference. In detail, they proposed to binarize the attention weight

into the Boolean value, while the design is driven by information entropy maximization. In

Bi-Attention, bool function is leveraged to binarize the attention score A, which is deﬁned

bool(x) =

if x ≥0

otherwise ^,

(5.28)